# Consider the following libraries for this exercise sheet:
library(proxy)
library(mlr3)
library(rpart.plot)
library(mlr3learners)
library(data.table)
library(mlr3verse)Exercise 9 – Random Forests
$$
% math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript
% … is distributed as …
% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)
%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm
% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda
% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model
% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet
% theta % theta hat % theta vector % theta vector hat %% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta
% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x
% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)
% pdf of x given y % p(x | y = k) % log p(x | y = k)% p(x^i | y = k)
% prior probabilities % pi_k, prior% log pi_k, log of the prior % Prior probability of parameter theta
% posterior probabilities % P(y = 1 | x), post. prob for y=1 % P(y = k | y), post. prob for y=k % pi with domain and co-domain % Bayes-optimal classification model % Bayes-optimal classification model % pi(x), P(y = 1 | x) % pi, bold, as vector % pi_k(x), P(y = k | x) % pi_k(x | theta), P(y = k | x, theta) % pi(x) hat, P(y = 1 | x) hat % pi_k(x) hat, P(y = k | x) hat % pi(x^(i)) with hat% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)
% probababilistic% Bayes rule % mean vector of class-k Gaussian (discr analysis)
% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class
% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error
% lm % linear model % OLS estimator in LM
% resampling % size of the test set % size of the train set % size of the i-th test set % size of the i-th train set % index vector train data % index vector test data % index vector i-th train dataset % index vector i-th test dataset % D_train,i, i-th training set% D_test,i, i-th test set
% space of train indices of size n_train % space of train indices of size n_train % space of train indices of size n_test % output vector associated to index J % def of the output vector associated to index J % cali-J, set of all splits % (Jtrain_1,Jtest_1) …(Jtrain_B,Jtest_B) % Generalization error % GE % GE-hat % GE full % GE hat holdout % GE hat holdout i-th set % GE-hat(lam) % GE-hat_I,J,rho(lam) % GE-hat_I,J,rho(lam) % GE formal def % aggregate function % GE of a fitted model % GEh of a fitted model % GE of a fitted model wrt loss L % pointwise loss of fitted model% GE of a fitted model % GE of inducer % GE indexed with data % expected GE % expectation wrt data of size n
% performance measure % perf. measure derived from pointwise loss % matrix of prediction scores % i-th row vector of the predscore mat % predscore mat idxvec J % predscore mat idxvec J and model f % predscore mat idxvec Jtest and model f hat % predscore mat idxvec Jtest and model f% predscore mat i-th idxvec Jtest and model f % def of predscore mat idxvec J and model f % Set of all datasets times HP space
% ml - ROC % no. of positive instances % no. of negative instances % proportion negative instances % proportion negative instances % true/false pos/neg: % true pos % false pos (fp taken for partial derivs) % true neg % false neg
% ml - trees, extra trees % (Parent) node N % node N_k % Left node N_1 % Right node N_2 % class probability node N % estimated class probability node N % estimated class probability left node% estimated class probability right node
% ml - bagging, random forest % baselearner, default m % estimated base learner, default m % baselearner, default m % ensembled predictor % estimated ensembled predictor % ambiguity/instability of ensemble % weight of basemodel m% weight of basemodel m with hat % last baselearner
% ml - boosting % prediction in iteration m % prediction in iteration m % prediction m-1 % prediction m-1 % weighted in-sample misclassification rate % weight vector of basemodel m % weight of obs i of basemodel m % parameters of basemodel m % parameters of basemodel m with hat % baselearner, default m % ensemble % pseudo residuals % pseudo residuals % terminal-region % terminal-region % mean, terminal-regions % mean, terminal-regions with hat% mean, terminal-regions
% ml - boosting iml lecture % theta % BL j with theta % BL j with theta $$
Hint: Useful libraries
Exercise 1: Bagging
In this exercise, we briefly revisit why bagging is a useful technique to stabilize predictions.
For a fixed observation \((\mathbf{x}, y)\), show that the expected quadratic loss over individual base learner predictions \(b^{[m]}(\mathbf{x})\) is larger than or equal to the quadratic loss of the prediction \(f^{[M]}(\mathbf{x})\) of a size-\(M\) ensemble. You can consider all hyperparameters of the base learners and the ensemble fixed.
Hint
Use the law of total expectation (“Verschiebungssatz der Varianz”: \(\mathsf{Var}(Z) = \mathbb{E}(Z^2) - (\mathbb{E}(Z))^2 ~ \Longleftrightarrow ~ \mathbb{E}(Z^2) = \mathsf{Var}(Z) + (\mathbb{E}(Z))^2\), where \(\mathsf{Var}(Z) \geq 0\) by definition.) stating \(\mathbb{E}(Z^2) \geq (\mathbb{E}(Z))^2\) for a random variable \(Z\).
Solution
- Start with the LHS of the inequality: the expected quadratic loss over individual base learner predictions \(\rightsquigarrow\) \(\mathbb{E}\left( \left(y - b^{[m]}(\mathbf{x})\right)^2 \right)\)
- Get clear about RHS: the quadratic loss of the prediction of a size-\(M\) ensemble \(\rightsquigarrow\) \(\left( y - \left(f^{[M]}(\mathbf{x})\right) \right)^2\)
- Before we get busy moving from LHS to RHS, let’s think about the expectation for a minute.
Expected value of a RV: (possibly infinite) sum over all values the RV could take, weighted by the probability of observing that value.
What even is the RV here? Since the base learner and ensemble structure is fixed for given data, the only stochastic part is the bootstrap sample in the \(m\)-th tree given training data from the data-generating process. We could write this as \(\mathbb{E}_{\mathcal{D}_{\text{train}}^{[m]} ~|~ \mathcal{D}_{\text{train}} \sim \mathbb{P}_{xy}}\). In this exercise we’ll omit the subscript for ease of notation.
- Back to our proof, which will make use of the LOTV.
- By the LOTV, we know that \(\mathbb{E}(Z^2) \geq (\mathbb{E}(Z))^2\) for some RV \(Z\).
- Applying that to our LHS, we obtain \[ \mathbb{E}\left( \left(y - b^{[m]}(\mathbf{x})\right)^2\right) \geq \left(\mathbb{E}\left( y - b^{[m]}(\mathbf{x})\right) \right)^2 . \]
- Expected values of larger terms can often be simplified so the expectation is only over the actually stochastic parts (using linearity of expectation), yielding: \[ \mathbb{E}\left( \left(y - b^{[m]}(\mathbf{x})\right)^2\right) \geq \left( y - \mathbb{E}\left(b^{[m]}(\mathbf{x})\right) \right)^2 . \]
- The last missing step is to show that \(\mathbb{E}\left(b^{[m]}(\mathbf{x})\right) = f^{[M]}(\mathbf{x})\). To compute the expectation for this discrete random variable (we have a finite ensemble), we sum over all possible realizations, weighted by their probability of occurence. This can be further simplified given that all of the \(M\) bootstrap samples were drawn with equal probability:
\[ \mathbb{E}\left(b^{[m]}(\mathbf{x})\right) = \sum_{m = 1}^M b^{[m]}(\mathbf{x})p \left(\mathcal{D}_{\text{train}}^{[m]} \right) = \frac{1}{M} \sum_{m = 1}^M b^{[m]}(\mathbf{x}), \]
which is precisely the ensemble prediction \(f^{[M]}(\mathbf{x})\).
Putting everything together, we get
\[ \mathbb{E}\left( \left(y - b^{[m]}(\mathbf{x})\right)^2\right) \geq \left( y - \left(f^{[M]}(\mathbf{x})\right) \right)^2, \]
showing that the expected quadratic loss over individual base learner predictions is at least as large as the loss of the ensemble prediction.
Exercise 2: Classifying spam
Take a look at the
spamdataset and shortly describe what kind of classification problem this is.Hint
Access the corresponding task
?mlr3::mlr_tasks_spam.Read spam.csv.
Solution
The spam data is a binary classification task where the aim is to classify an e-mail as spam or non-spam:
Load data
Inspect
Use a decision tree to predict
spam. Re-fit the tree using two random subsets of the data (each comprising 60% of observations). How stable are the trees?Hint
Use
rpart.plot()from the packagerpart.plotto visualize the trees.Use
from sklearn.tree import plot_treeto visualize the trees.
Solution
Full dataset
Data subsets
Observation: trees trained on different samples differ considerably in their structure, regarding split variables as well as thresholds (recall, though, that the split candidates are a further source of randomness).
- Forests come with a built-in estimate of their generalization ability via the out-of-bag (OOB) error.
- Show that the probability for an observation to be OOB in an arbitrary bootstrap sample converges to \(\tfrac{1}{e}\).
Solution
This requires a little trick. - First, think about the probability of an observation to be OOB in a tree. Imagine the tree’s bootstrap sample has \(n\) free spots, to be filled from the training observations. - In each place, the probability of being drawn for the observation is \(\tfrac{1}{n}\) (all observations are equally likely to be selected). Conversely, the probability of not being drawn is \(1 - \tfrac{1}{n}\). - We draw with replacement, meaning the events of filling a place in the bootstrap sample are all independent. The probability of not being drawn at all for any of the free spots – i.e., not ending up in the tree’s bootstrap sample and thus being OOB – is thus \(\left(1- \tfrac{1}{n}\right)^n\). - You can imagine that this probability is lower if we only have a few observations. It will converge to a fixed value for larger datasets. - Since we’re interested in a general statement, we look for this stable value, taking \(n\) to the limit: \(\lim_{n \to \infty} \left( 1 - \tfrac{1}{n} \right)^n\). - Now comes the trick: If you’re well-versed in analysis you might recognize this expression as a way to characterize the exponential function. - For an arbitrary input \(x\), we have \(e^x = \lim_{n \to \infty} \left( 1 + \tfrac{x}{n} \right)^n\). - We see that our above probability is equivalent to the exponential function at input value -1, resulting in \[ \lim_{n \to \infty} \left( 1 - \tfrac{1}{n} \right)^n = e^{-1} = \tfrac{1}{e} \approx 0.37. \]
- Use the random forest learner (R:
classif.ranger, Python:RandomForestClassifier()to fit the model and state the out-of-bag (OOB) error.
Solution
The OOB error can be computed by:
You are interested in which variables have the greatest influence on the prediction quality. Explain how to determine this in a permutation-based approach and compute the importance scorses for the data.
Solution
Variable importance in general measures the contributions of features to a model. One way of computing the variable importance of the \(j\)-th variable is based on permuting it for the OOB observations and calculating the mean increase in OOB error this permutation entails.
In order to determine the with the biggest influence on prediction quality, we can choose the \(k\) variables with the highest importance score, e.g., for \(k = 5\):
Numerical scores
Visual representation
Exercise 3: Proximities
You solve the wine task, predicting the of a wine – with 3 classes – from a number of covariates. After training, you wish to determine how similar your observations are in terms of proximities.
The model information was created with ranger::treeInfo(), which assigns observations with values larger than splitval to the right child node in each split.
| observation | alcalinity | alcohol | flavanoids | hue | malic | phenols |
|---|---|---|---|---|---|---|
| 1 | 11.4 | 14.75 | 3.69 | 1.25 | 1.73 | 3.10 |
| 2 | 25.0 | 13.40 | 0.96 | 0.67 | 4.60 | 1.98 |
| 3 | 17.4 | 13.94 | 3.54 | 1.12 | 1.73 | 2.88 |
[1] "Tree 1:"
| nodeID | leftChild | rightChild | splitvarID | splitvarName | splitval | terminal | prediction |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 1 | alcohol | 12.70 | FALSE | NA |
| 1 | 3 | 4 | 2 | flavanoids | 0.97 | FALSE | NA |
| 2 | 5 | 6 | 2 | flavanoids | 1.58 | FALSE | NA |
| 3 | NA | NA | NA | NA | NA | TRUE | 3 |
| 4 | NA | NA | NA | NA | NA | TRUE | 2 |
| 5 | NA | NA | NA | NA | NA | TRUE | 3 |
| 6 | NA | NA | NA | NA | NA | TRUE | 1 |
[1] "Tree 2:"
| nodeID | leftChild | rightChild | splitvarID | splitvarName | splitval | terminal | prediction |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 2 | flavanoids | 2.31 | FALSE | NA |
| 1 | 3 | 4 | 5 | phenols | 1.84 | FALSE | NA |
| 2 | 5 | 6 | 3 | hue | 0.86 | FALSE | NA |
| 3 | NA | NA | NA | NA | NA | TRUE | 3 |
| 4 | NA | NA | NA | NA | NA | TRUE | 2 |
| 5 | NA | NA | NA | NA | NA | TRUE | 2 |
| 6 | NA | NA | NA | NA | NA | TRUE | 1 |
[1] "Tree 3:"
| nodeID | leftChild | rightChild | splitvarID | splitvarName | splitval | terminal | prediction |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 1 | alcohol | 12.78 | FALSE | NA |
| 1 | 3 | 4 | 2 | flavanoids | 1.23 | FALSE | NA |
| 2 | 5 | 6 | 2 | flavanoids | 1.49 | FALSE | NA |
| 3 | NA | NA | NA | NA | NA | TRUE | 3 |
| 4 | NA | NA | NA | NA | NA | TRUE | 2 |
| 5 | NA | NA | NA | NA | NA | TRUE | 3 |
| 6 | NA | NA | NA | NA | NA | TRUE | 1 |
For the following subset of the training data and the random forest model given above,
- find the terminal node of each tree the observations are placed in,
Solution
Using the treeInfo() output, we can follow the path of each sample through each tree. The following table prints for each observation (rows) their terminal nodes as assigned by trees 1-3. For example, consider observation 1 in tree 1 (first cell): the observation has phenols \(> 1.94\), putting it in node 2 (rightChild of node 0), from there in node 6 (because it has alcohol \(> 13.04\)).
- compute the observations’ pairwise proximities, and
Solution
For the proximities, we consider each pair of observations and compute the relative frequency of trees assigning them to the same terminal node.
Observations 1 and 2: only tree 1 assigns them to the same node, so the proximity is \(\tfrac{1}{3}\).
Observations 1 and 3: all trees assign them to the same node, so the proximity is 1.
Observations 2 and 3: only tree 1 assigns them to the same node, so the proximity is \(\tfrac{1}{3}\).
- construct a similarity matrix from these proximities in
Rresp.Python.
Solution
We can put this information into a similarity matrix (as such matrices become large quite quickly for more data, it is common to store only the lower diagonal – the rest is non-informative/redundant):